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Abstract 

Graph Laplacians and related nonlinear mappings into low dimensional spaces have been shown 
to be powerful tools for organizing high dimensional data. Here we consider a data set X in 
which the graph associated with it changes depending on some set of parameters. We analyze 
this type of data in terms of the diffusion distance and the corresponding diffusion map. As the 
data changes over the parameter space, the low dimensional embedding changes as well. We give 
a way to go between these embeddings, and furthermore, map them all into a common space, 
allowing one to track the evolution of X in its intrinsic geometry. A global diffusion distance 
is also defined, which gives a measure of the global behavior of the data over the parameter 
space. Approximation theorems in terms of randomly sampled data are presented, as are potential 
applications. 

Keywords: diffusion distance; graph Laplacian; manifold learning; dynamic graphs; 
dimensionality reduction; kernel method; spectral graph theory 



1. Introduction 

In this paper we consider a changing graph depending on certain parameters, such as time, 
over a fixed set of data points. Given a set of parameters of interest, our goal is to organize the 
data in such a way that we can perform meaningful comparisons between data points derived 
from different parameters. In some scenarios, a direct comparison may be possible; on the other 
hand, the methods we develop are more general and can handle situations in which the changes to 
the data prevent direct comparisons across the parameter space. In particular, one may consider 
situations in which the fundamental building blocks of the data set change, perhaps changing the 
dimension of the data. In order to make meaningful comparisons between different realizations 
of the data, we look for invariants in the data set as it changes. We model the data set as a 
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normalized, weighted graph, and measure the similarity between two points based on how the 
local subgraph around each point changes over the parameter space. The framework we develop 
will allow for the comparison of any two points derived from any two parameters within the 
graph, thus allowing one to organize not only along the data points but the parameter space as 
well. 

An example of this type of data comes from hyperspectral image analysis. A hyperspectral 
image is in fact a set of images of the same scene that are taken at different wavelengths. Put 
together, these images form a data cube in which the length and width of the cube correspond 
to spatial dimensions, and the height of the cube corresponds to the different wavelengths. Thus 
each pixel is in fact a vector corresponding to the spectral signature of the materials contained 
in that pixel. Consider the situation in which we are given two hyperspectral images of the 
same scene, and we wish to highlight the anomalous (e.g., man made) changes between the 
two. Perhaps though, for each data set, different cameras were used which measured different 
wavelengths, perhaps also at different times of day under different weather conditions. In such a 
scenario a direct comparison becomes much more difficult. Current work in the field often times 
goes under the heading change detection, as the goal is to often find small changes in a large 
scene; see HI for more details. 

Other possible areas for applications come from the modeling of social networks as graphs. 
The relationships between people change over time and determining how groups of people inter- 
act and evolve is a new and interesting problem that has usefulness in marketing and other areas. 
Financial markets are yet another area that lends itself to analysis conducted over time, as are 
certain evolutionary biological questions and even medical problems in which patient tests are 
updated over the course of their lives. 

Let I denote our parameter space, and let X a , with a e I, be the data in question. The 
elements of our data set are fixed, but the graph changes depending on the parameter a. In other 
words, there is a known bijection between X a and Xp for a, ft e I, but the corresponding graph 
weights of X have changed between the two parameters. For a fixed a, the diffusion maps frame- 
work developed in |2| gives a multiscale way of organizing X a . More specifically, the diffusion 
mapping maps X a into a particular £ 2 space in which the usual I 1 distance corresponds to the 
diffusion distance on X a . However, for different times a and ft, the diffusion map may take X a 
and Xp into different I 1 spaces, thus meaning that one cannot take the standard i 1 distance be- 
tween the elements of these two spaces. Our contribution here is to generalize the diffusion maps 
framework so that it works independently of the parameter a. In particular, we derive formulas 
for the distance between points in different embeddings that are in terms of the individual diffu- 
sion maps of each space. It is even possible to define a mapping from one embedding to the other, 
so that after applying this mapping the standard £ 2 distance can once again be used to compute 
diffusion distances. Once this generalized framework has been established, we are able to define 
a global distance between all of X a and Xp based on the behavior of the diffusions within each 
data set. This distance in turn allows one to model the global behavior of X a as it changes over 
I. 

Earlier results that use diffusion maps to compare two data sets can be found in [3 |. Fur- 
thermore, there is recent work contained in |4| That also involves combining diffusion geom- 
etry principles via tree structures with evolving graphs. In 0, the author considers the case 
of an evolving Riemannian manifold on which a diffusion process is spreading as the manifold 
evolves. In our work, we separate out the two processes, effectively using the diffusion process to 
organize the evolution of the data. Also tangentially related to this work are the results contained 
in (6) on shape analysis, in which shapes are compared via their heat kernels. More generally, 
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this paper fits into the larger class of research that utilizes nonlinear mappings into low dimen- 
sional spaces in order to organize potentially high dimensional data; examples include locally 
linear embedding (LLE) Q, ISOMAP ||8], Hessian LLE g), Laplacian eigenmaps ED, and the 
aforementioned diffusion maps [2]. 

An outline of this paper goes as follows: in the next section, we take care of some notation 
and review the diffusion mapping first presented in [2]. In Section|5]we generalize the diffusion 
distance for a data set that changes over some parameter space, and show that it too can be 
computed in terms the spectral embeddings of the corresponding diffusion operators. We also 
show how to map each of the embeddings into one common embedding in which the t 2 distance is 
equal to the diffusion distance. The global diffusion distance between graphs is defined in Section 
|4j it is also seen to be able to be computed in terms of the eigenvalues and eigenfunctions of the 
relevant diffusion operators. In Section[5]we set up and state two random sampling theorems, one 
for the diffusion distance and one for the global diffusion distance. The proofs of these theorems 
are given in Appendix B Section [6] contains some applications, and we conclude with some 



remarks and possible future directions in Section[7] 



2. Notation and preliminaries 

In this section we introduce some basic notation and review certain preliminary results that 
will motivate our work. 



2.1. Notation 

Let K denote the real numbers and let N = {1, 2, 3, . . .}. Often we will use constants that 
depend on certain variables or parameters. We let C(-), Ci(-), C2O7 etc, denote these constants; 
note that they can change from line to line. 

We recall some basic notation from operator theory. Let <H be a real, separable Hilbert space 
with scalar product <•, •) and norm || ■ ||. Let A : H —> <H be a bounded, linear operator, and let 
A* be its adjoint. The operator norm of A is defined as: 

||A|| = sup \\Af\\. 

11/11=1 

A bounded operator A is Hilbert-Schmidt if 

for some (and hence any) Hilbert basis {c®},>i. The space of Hilbert-Schmidt operators is also a 
Hilbert space with scalar product 

(A,B) HS 4£<A e ®, Be®>. 

We denote the corresponding norm as || • \\hs- Note that if an operator is Hilbert-Schmidt, then it 
is compact. A subset of the Hilbert-Schmidt operators are those that are trace class. A bounded 
operator A is trace class if 

^< ^/A*Ae (i) ,e (i) ) < 00 
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for some (and hence any) Hilbert basis {e w },>i. For any trace class operator A, we have 

Tr(A) = ^<Ae (,) , e (/) > < oo, 

where Tr(A) is called the trace of A. The space of trace class operators is a Banach space endowed 
with the norm 

IIA||rc = Tr( VA^A). 
Note that the different operator norms are related as follows: 

||A|| < ||A|lfls < ||A|| rc . 

2.2. Diffusion maps 

In this section we consider just a single data set that does not change with some parameter 
and review the notion of diffusion maps on this data set. We assume that we are given a measure 
space (X,p), consisting of data points X that are distributed according to p. The diffusion maps 
framework developed in [2J gives a multiscale organization of the data set X. We also have a 
positive, symmetric kernel k : X x X — > M that encodes how similar two data points are. From 
X and k, one can construct a weighted graph F = {X, k), in which the vertices of F are the data 
points x e X, and the weight of the edge xy is given by k(x,y). 

Define the density, m : X — » M, as 

m(x) = J k(x,y)dfi(y), forallxeX. (1) 

x 

We assume that the density m satisfies 

m(x) > 0, foryU a.e. x e X, (2) 

and 

meL\X,p). (3) 

Given Q, the weight function 

A k(x,y) 
p(x,y) = —— 
m(x) 

is well defined for fi ® yu-a.e. (x,y) e X x X. Although p is no longer symmetric, it does satisfy 
the following useful property: 

J" p(x, y) dp(y) - I, for p a.e. x e X. 

x 

Therefore we can view p as the transition kernel of a Markov chain on X. Equivalently, if 
p € L 2 (X xX,p<g> p.), the integral operator P : L 2 (X,p) -> L 2 (X,p), defined as 

(Pf)(x) 4 j p(x, y)f(y) dp(y), for all / e L 2 (X, p), 

x 



is a diffusion operator. In particular, the value p(x,y) represents the probability of transition in 
one time step from the vertex x to the vertex y, which is proportional to the edge weight k(x,y). 
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For t G N, let p®(x,y) represent the probability of transition in t time steps from the node x to 
the node y; note that p w is the kernel of the operator P'. As shown in [2|, running the Markov 
chain forward, or equivalently taking powers of P, reveals relevant geometric structures of X at 
different scales. In particular, small powers of P will segment the data set into several smaller 
clusters. As t is increased and the Markov chain diffuses across the graph F, the clusters evolve 
and merge together until in the limit as t — > oo the data set is grouped into one cluster (assuming 
the graph is connected). 

The phenomenon described above can be encapsulated by the diffusion distance at time t 
between two vertices x and y in the graph F. In order to define the diffusion distance, we first 
note that the Markov chain constructed above has the stationary distribution n : X — » R, where 

m(x) 

n(x) = 



J x m(y)djj(y) 



Combining Q and ([3) we see that n(x) is well defined for p a.e. x e X. The diffusion distance 
between x,y e X is then defined as: 



D {, \x,y) 2 4 U'\x,-)-p (t) (y,' 



J {p {, \x,u)-p^(y,u)) 



2 

2 d/d(u) 
n(u) 



A simplified formula for the diffusion distance can be found by considering the spectral decom- 
position of P. Define the kernel a : X x X — > K as 

A y/m(x) k(x, y) 

a(x,y) = p(x, yj = — for p. ®/x a.e. (x,y) e XxX. 

yjm(y) ^m(x) y/m(y) 

If a G L 2 (Z x X,p ® yu), then P has a discrete set of eigenfunctions {i/ w },>i with corresponding 
eigenvalues {/t w }/>i. It can then be shown that 

mx,yf = ^{A & f(^Hx)-^(y)f. (4) 
(>i 

Inspired by (|4]), [2] defines the diffusion map Y w : X — > ^ 2 at diffusion time f to be: 

Therefore, the diffusion distance at time f between ij e lis equal to the ( 2 norm of the differ- 
ence between T (,) (x) and T w (y): 

D®(x,y) = ||T®(jc)-T«Cy)|U. 

One can also define a second diffusion distance in terms of the symmetric kernel a as opposed 
to the asymmetric kernel p. In particular, define the operator A : L 2 (X,p) — > L 2 (X,p) as 

(A/)(x) 4 J fl(x , y)/(y) ^(y), for all / G L 2 (X, p). 



Like the diffusion operator P, the operator A and its powers, A', reveal the relevant geometric 
structures of the data set X. Letting a (f) :XxX->M denote the kernel of the operator A', we can 
define another diffusion distance D w : X x X — > R as follows: 

D«\ X ,yf = \\a^x,.)-a"(y4\l 2(Xs) 

= j (a w (x, u) - a (t) (y, u)f d/i(u). 
x 

As before, we consider the spectral decomposition of A. Let {/1 (,) },>i and {(fr (,) },>i denote the 
eigenvalues and eigenfunctions of A (indeed, the nonzero eigenvalues of P and A are the same), 
and define the diffusion map V F ( ' ) : X — > ( 2 (corresponding to A) as 

«ptt(jc) = ((A®)V i) (x)) > . 

Then, under the same assumptions as before, we have 

D«(x,y) 2 = H^Oc) - = £ (^>(jc) - «A (!) (j)) 2 . (5) 

!>1 

We make a few remarks concerning the differences between the two formulations. First, we 
note that the original diffusion distance D w is defined as an L 2 distance under the weighted mea- 
sure dfi/n. The second diffusion distance, D w , due to the symmetric normalization built into the 
kernel a, is defined only in terms of the underlying measure fi. Furthermore, the eigenfunctions 
of A are orthogonal, unlike the eigenfunctions of P. Finally, as we have already noted, the eigen- 
values of P and A are in fact the same, and furthermore they are contained in (-1, 1]. If the graph 
r is connected, then the eigenfunction of P with eigenvalue one is simply the function that maps 
every element of X to one. The corresponding eigenfunction of A though is the square root of the 
density, i.e., y/mjxj. Thus, while both versions of the diffusion distance merge smaller clusters 
into large clusters as t grows, D {,) will merge every data point into the same cluster in the limit 
as t — > oo, while D w will reflect the behavior of the density m in the limit as t — > oo. 

3. Generalizing the diffusion distance for changing data 

In this section we generalize the diffusion maps framework for data sets with input parame- 
ters. 

3.1. Defining the diffusion distance on a family of graphs 

We now turn our attention to the original problem introduced at the beginning of this paper. 
In its most general form, we are given a parameter space I and a data set X a that depends on 
a e I. The data points of X a are given by x a . The parameter space I can be continuous, discrete, 
or completely arbitrary. Since there is an obvious and a priori known bijective correspondence 
between X a and Xp for any a, ft € I, we consider the following model throughout the remainder 
of this paper. We are given a single measure space (X, fj) that we think of as changing over I. The 
measure fi here represents some underlying distribution of the points in X that does not change 
over I. The evolution of (X, /j.) is completely encoded by a family of kernels k a : X x X R, 
which measure the similarity between two data points x,y eX for the parameter a e I. 
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Our goal is to reveal the relevant geometric structures of X across the entire parameter space 
I, and to furthermore have a way of comparing structures from one parameter to other structures 
derived from a second parameter. To do so, we shall generalize the diffusion distance so that we 
can compare diffusions derived from different parameters. The first step is to once again consider 
each pairing X and k a as a weighted graph, which we denote as F a — (X, k a ). 

Updating our notation for this dynamic setting, for each parameter a e I we have the density 
m„ : X — > M. defined as 



ma ( x) ±Jk a(x ,y)d,(y), far all or* J, xel 

x 

For reasons that shall become clear later, we slightly strengthen the assumptions on m a as com- 
pared to those in equations (|2]) and ([3]). In particular, we assume that 

m a (x) > 0, for all a e I, x e X, 

and 

m a e L x {X,n), for all a e I. 

We then define two classes of kernels a a : X x X — > K and p a : X X X — > R in the same manner 
as earlier: 

a a (x, y) = k a( x >y)_ _ > for all aeJ ^ (lj)eXx x> (6) 



and 



P«(*,y)= Z^ , forallaeJ, (x,y)eXxX. 
m a (x) 



Assume that a a ,p a e L 2 (X xX,/i8 Their corresponding integral operators are given by 
A a : L 2 (X,jd) -> L 2 {X,ii) and P a : L 2 (X,fi) -> L 2 (X,p), where 

(A a f)(x)± f a a (x,y)f(y)dn(y), for all a e J, / e L 2 (X,p), (7) 

and 

(P a f)(x) = j p a (x,y)f(y)dfi(y), for all a e I, f e L 2 (X,p). 

x 

Finally, we let and p„ denote the kernels of the integral operators A' a and T 5 ^, respectively. 

Returning to the task at hand, in order to compare r„ with F^, it is possible to use the operators 
A a and Ap or P a and Pp. We choose to perform our analysis using the symmetric operators, as 
it shall simplify certain things. For now, consider the function a a (x, •) for a fixed x e X. We 
think of this function in the following way. Consider the graph F„, and imagine dropping a unit 
of mass on the node x and allowing it to spread, or diffuse, throughout F a . After one unit of 
time, the amount of mass that has spread from x to some other node y is proportional to a a (x,y). 
Similarly, if we want to let the mass spread throughout the graph for a longer period of time, we 
can, and the amount of mass that has spread from x to y after t units of time is then proportional 
to a ( a{x,y). The diffusion distance at time t, which is the L 2 norm of di\x, ■) - a„ (y, •), is then 
comparing the behavior of the diffusion centered at x with the behavior of the diffusion centered 
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at y. We wish to extend this idea for different parameters a and ft. In other words, we wish to 
have a meaningful distance between x at parameter a and y at parameter ft that is based on the 
same principle of measuring how their respective diffusions behave. 

Our solution is to generalize the diffusion distance in the following way. For each diffusion 
time t e N, we define a dynamic diffusion distance D (f) : (X x I) x (X x J) — > R as follows. Let 
x a = (i,d)elx I, and set 



This notion of distance can be thought of as comparing how the neighborhood of x a differs from 
the neighborhood of yp. In particular, if we are comparing the same data point but at different 
parameters, for example x a and Xp, the diffusion distance between them will be small if their 
neighborhoods do not change much from a to ft. On the other hand, if say a large change occurs 
at x at parameter ft, then the neighborhood of Xp should differ from the neighborhood of x a and 
so they will have a large diffusion distance between them. 

Some more intuition about the quantity D ( '\x a ,yp) can be derived from the triangle inequal- 
ity. In particular, one application of it gives 



Thus we see that D ( '\x a ,yp) is bounded from above by the change in x from a to ft (i.e. the 
quantity D {t \x a , Xp)) plus the diffusion distance between x and y in the graph Yp (i.e. the quantity 



Remark 3.1. As noted earlier, we have chosen to generalize the diffusion distance in terms of 
the symmetric kernels a a as opposed to the asymmetric kernels p a . The primary reason for 
this choice is that when using the kernel p a to compute the diffusion distance between x and 
y, we must use the weighted measure dp/n a , where n a denotes the stationary distribution of 
the Markov chain on Y a . Thus, when computing the diffusion distance between x a and yp, one 
must incorporate this weighted measure as well. Since the stationary distribution will invariably 
change from a to ft, the most natural generalization in this case would be: 



3.2. Diffusion maps for {T a } ae j 

Analogous to the diffusion distance for a single graph F = (X, k), we can write the diffu- 
sion distance for {Y a ) ae i in terms the spectral decompositions of {A a } a€ j. We first collect the 
following mild, but necessary, assumptions, some of which have already been stated. 

Assumption 1. We assume the following properties: 

1. X QR d and p, is a cr-finite measure. 

2. The kernel k a is positive definite and symmetric for all a € I. 



D (t \x a ,yp) 2 4 




x 



D (t \x a ,yp) < D { '\x a ,xp) + D { '\xp,yp). 



D (t) (xp,yp)). 
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3. For each a e I, m a e L (X,fi) and m a > 0. 

4. For any a € I, the operator A a is trace class. 

A few remarks concerning the assumed properties. First, the reader may have noticed that 
we replaced the assumption that k a be positive with the stronger assumption that it is positive 
definite. This combined with the third property that m a (x) > for all x e X, implies that a a is 
also positive definite. Thus the operators A a are positive and self adjoint. 

If one wished to revert back to the weaker assumption that k a merely be positive, then the 
following adjustment could be made. Clearly the symmetrically normalized kernel a a will still 
be positive, but the operator A a may not be. However, one could replace A a , for each a e I, with 
the graph Laplacian L a : L 2 {X,fi) — > L 2 (X,/u), which is defined as 

A 1 

L a — ^ V ~ A^ah 

where / : L 2 (X,p) — > L 2 (X,fi) is the identity operator. The graph Laplacian L a is a positive 
operator with eigenvalues contained in [0, 1]. The analysis that follows would still apply with 
only minor adjustments. 

The fourth item that A a be trace class plays a key role in the results of this section, and 
itself implies that these operators are Hilbert-Schmidt and so also compact. Thus, as a further 
consequence, a a e L 2 (X xl,/i®/i) for each a 6 I. 

We note that assumptions three and four are both satisfied if for each a € I the kernel k a is 
continuous, bounded from above and below, and if the measure of X is finite. That is, if for each 
a, 

< Ci (a) < k a (x, y) < C 2 (a) < oo, for all (x, y)eXxX, 

and 

jJ.(X) < oo, 

then we can derive assumptions three and four. 

As an immediate consequence of the properties contained in Assumption [T] we see that for 
each a the operator A a has a countable collection of positive eigenvalues and orthonormal eigen- 
functions. Let [A„ }j>i and {i/'?};>i be the eigenvalues and a set of orthonormal eigenfunctions of 
A a , respectively, so that 

(A a ^)(x) = Afifrfix), for n a.e. x e X, 

and 

^' ) > L 2 ( ^ ) = S(i - j), for all i, j>l. 

Furthermore, as noted in (2), the eigenvalues of P a are bounded in absolute value by one, with 
at least one eigenvalue equaling one. Since the eigenvalues of A a and P a are the same, we also 
have 

1 = A m > A {2) > A (3) > 

where — * as i — > oo. 

As with the original diffusion distance defined on a single data set, our generalized notion 
of the diffusion distance for dynamic data sets has a simplified form in terms of the spectral 
decompositions of the relevant operators. 
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Theorem 3.2. Let (X,p) be a measure space and {k a } aeI a family of kernels defined on X. If 
(X,p) and {k a } ae j satisfy the properties of Assumption^ then the diffusion distance at time t 
between x a and yp can be written as: 



= X (4f ^(xf + X {Aft *f(y? 

i>l >>1 

- 2 J] (A®/ (^ ^(x) tf>(y) (tf>, ^f) LW (8) 



U>1 



where for each pair (a,/3) e J" xj, equation (|8]l converges in L (X Y.X, p® p). If additionally, k a 
is continuous for each a £ I, X C M. d is closed, and p is a strictly positive Borel measure, then 
((S) holds for all (x,y) e X x X. 



We postpone the proof of Theorem 3.2 till the end of this section. Notice that equation ([8]) is 
in fact an extension of the formula given for the diffusion distance on a single data set. Indeed, 
if one were to take x a and yp — y a , the formula given in ([8) would simplify to |5]l with the 
underlying kernel taken to be k a . Thus, it is natural to define the diffusion map : X — > £ 2 for 
the parameter a and diffusion time t as 

^w^lflV.w^ ( 9 ) 

For v e I 1 , let v[i] denote the i* element of the sequence m. Using one can write equation (|8]l 

as 

D (,) (x a ,ypf = ||T«(x)|f 2 + pfiy)\t - 2 J] *f(y)[j] <(/r» ^/\ W (10) 

In particular, one has in general that 

Intuitively, the thing to take away from this discussion is that for each parameter a e I, the 
diffusion map maps X into an ^ 2 space that itself also depends on a. The C 1 embedding 
corresponding to a is not the same as the I 1 embedding corresponding to /3 e I, but equation 
([TO) gives a way of computing distances between the different i 2 embeddings. 

Also, once again paralleling the original diffusion distance, we see that if the eigenvalues 
of A a and Ap decay sufficiently fast, then the diffusion distance can be well approximated by a 
small, finite number of eigenvalues and eigenfunctions of these two operators. In particular, we 
need only map F„ and Fp into finite dimensional Euclidean spaces. 

Remark 3.3. One interesting aspect of the diffusion distance is its asymptotic behavior as t — > oo, 
and in particular that behavior when the family of graphs {Fq.}^/ are connected graphs. In 
this case, each operator A a has precisely one eigenvalue equal to one, and the corresponding 
eigenfunction is the square root of the density, i.e., 

1 = > 4 2) > ^ 3) > . . . , and = V^OO- 
It is quite simple to show then that 



lim D ( '\x a ,ypf = ( yjm a (x) - ^mp(y)) + yjm a (x) yjmp(y) yjm a (-) - ^nip(- 
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) 



Thus the asymptotic diffusion distance can be computed without diagonalizing any of the dif- 
fusion operators. Furthermore, we see that it is not just the pointwise difference between the 
densities, but rather it is the pointwise difference plus a term that takes into account the global 
difference between the two densities. It can be used as a fast way of determing significant changes 
from a to ft. 



Proof of Theorem \3.2\ We first use the fact that for each a e I, A a is a positive, trace class 
operator. Thus A a is Hilbert-Schmidt, and so we know that for each a € I, 

a®(x,y) = ^(4 ) )V?(*)^aCv)> with convergence in L 2 (XxX,ii® n). (11) 

If the additional assumptions hold that k a is continuous, X is a closed subset of R d , and fi is a 
strictly positive Borel measure, then by Mercer's Theorem (see lfTTl[T2l ) equation ( fTTj ) will hold 
for all {x,y) e X x X. In this case the proof can be easily amended to get the stronger result; we 
omit the details. 

Expand the formula for D (t \x a ,yp) as follows: 



D {,) (x a , ypf = J (af{x, uf - 2a^(x, u) af(y, u) + af(y, uf) dfi(u). 
x 

We shall evaluate each of the three terms in (jT2j separately. For the cross term we have, 



(12) 



J 4\x, u) af(y, u) dn(u) = J £ (A®)' (if) if^(x) ^\y) ^(u) ifr ( J\u) 



x Vi,j>l 

r2 



dfx(u), (13) 



with convergence in L (X xX,/i® ju). At this point we would like to switch the integral and the 
summation in line ( fT3j ); this can be done by applying Theorem 2.25, page 55, from [ 13 1, which 
requires one to show the following: 

2 f 1(4^ [*f) ^a(x) ff(y) *f(u)\ dKu) < 00. (14) 

One can prove ( fi4"] > for yu ® ft almost every (x,y) e X x X through the use of Holder's Theorem 
and the fact that we assumed that A a is a trace class operator for each a e I; we leave the details 
to the reader. Thus for y. ® ft almost every (x, y) e X x X we can switch the integral and the 
summation in line ( fT3| , which gives: 

f 4\x, u) afty, u) dfi(u) = £ (a®)' (a^)' ifr^(x) <^(y) <0® ^ 7 \ TO (15) 
x W 

again with convergence in L 2 {X x X,/u ® //). A similar calculation shows that, for each a € I, 

f af(x, uf dfi(u) = ^ {^af ^a(x) 2 , with convergence in L 2 (X,fi). (16) 



Combining equations ( p"5| ) and ( fT6| ) we arrive at the desired formula for D^\x a ,yp). □ 
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3.3. Mapping one diffusion embedding into another 

As mentioned in the previous subsection, the diffusion map V P^ takes X into an t 2 space 
that itself depends on a. While ( p~0] > gives a way of computing distances between two diffusion 
embeddings, it is also possible to map the embedding W^'(X) into the ( 2 space of (X). Fur- 
thermore, the operator that does so is quite simple. The eigenfunctions {^®};>i are essentially a 
basis for the embedding of X with parameter a, while the eigenfunctions },->i are essentially 
a basis for the embedding of X with parameter /3. The operator that maps one space into the other 
is similar to the change of basis operator. Define 0^ a : i 2 - 



O. 



Vj>i 



as 



for all vEf 



By the Spectral Theorem, we know that the eigenfunctions of A a can be taken to form an 
orthonormal basis for L 2 (X, p). Thus, the operator O a ^p preserves inner products. Indeed, define 
the operator S a : L 2 (X,p) — > i 2 as 

S a f = (<0^/W*)) fel , for all / e L 2 (X,p). 

The adjoint of S a , S* : t 2 — > L 2 (X,p), is then given by 

S> = i^, for all vet 2 . 

(>i 

Since }j>i is an orthonormal basis for L 2 (X,p), = Il 2 (x,ii)- Therefore, for any v, w e £ 2 , 

f \ 

(Of^ a V,0^ a w) e = J] V[j]w[k] Ypa^LKX^^f)^) 
j,k>l V i>l 

= 2 vrjiww^^.s^®^ 



= 2 v wm*]*q-*) 

= <V, W>^2 



(17) 



As asserted, the operator 0«_>a preserves inner products. In particular, it preserves norms, so we 
have 



i,j>i 



D { '\x a ,XB 



Thus the operator 0^ a maps the diffusion embedding ^^(X) into the same t 1 space as the 

diffusion embedding < Pjf (X), and furthermore preserves the diffusion distance between the two 
spaces; it is easy to see that it also preserves the diffusion distance within Tg. In particular, it 
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is possible to view both embeddings in the same i 2 space, where the I 2 distance is equal to the 
diffusion distance both within each graph Y a and Yp and between the two graphs. 

Suppose now that we have three or more parameters in I that are of interest. Can we map all 
diffusion embeddings of these parameters into the same I 2 space, while preserving the diffusion 
distances? The answer turns out to be "yes," and in fact we can use the same mapping as before. 
Let y e I be the base parameter to which all other parameters are mapped, and let a,/3 e I 
be two other arbitrary parameters. We know that we can map the embedding ^'(X) into the t 2 
space of x ¥y (X), and that we can also map the embedding ^¥^{X) into the I 2 space of *Pj r(X), 
and that these mappings will preserve diffusion distances both within T y , Y a , and Yp, and also 
between Y y and r„ as well as between Y y and Yp. We just need to show that they preserve the 
diffusion distance between points of Y a and points of Yp. Using essentially the same calculation 
as the one used to derive \YI) , one can obtain the following for any v, w e l 2 \ 

{O a ^ y v,Op^w) e - = Yj v W wUl {^K^LMx^y 

But then we have: 

\\O a ^\x) - <W*$>()C = ||<W*f(*)|& + \h^f(y)\\l - 2{O a ^\x),Op^{y)) t 2, 

UJ>i 

= D w (x a ,yp). 

Thus, after mapping the a and /3 embeddings appropriately into the y embedding, the ( 2 distance 
is equal to all possible diffusion distances. It is therefore possible to map each of the embeddings 
{'Pa CX)}a6l into the same C 2 space. In particular, one can track the evolution of the intrinsic 
geometry of X as it changes over I. We summarize this discussion in the following theorem. 

Theorem 3.4. Let (X,/u) be a measure space and {k a } ae j a family of kernels defined on X. Fix 
a parameter y e I. If(X,p) and {k a } ae j satisfy the properties of Assumption^ then for all 
(a,P) elxl, 

D ( '\x a ,yp) = lOa^y^^Jf) - Ojg^y^Cy)! , with convergence in L 2 (X xX,p® p). 
3.4. Historical graph 

The diffusion distance D (, \x a , yp) defines a measure of similarity between x a and yp by com- 
paring the local neighborhoods of each point in their respective graphs. The comparison is, by 
definition, indirect. It is possible though to use the diffusion distance to create a historical graph 
in which every point throughout X x / is compared directly. 

Suppose, for example, that I c K and that p is a measure for I. Assume that p(I) < oo, 
p(X) < oo, < C\ < k a (x,y) < C2 < °° for all x,y e X, a € I, and that the function 
(x,y, a) i-» k a (x,y) is a measurable function from (X x X x I,p ® p ® p) to HI. Then for each 
t e N, one can define a kernel k, : (X x I) x (X x I) -> R as 

k,(x a , yp) = e - D " )( t -- v ' j)/£ , for all (x a , yp) e (X x I) x (X x I), 

where s > is a fixed scaling parameter. The kernel k, is a direct measure of similarity across X 
and the parameter space I. Thus, when I is time, we think of (X x I, k t ) as defining a historical 
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graph in which all points throughout history are related to one another. By our assumptions, it is 
not hard to see that < C\(t) < k,(x a ,yp) < C2W < °° for all x a ,yp e X x I. Therefore we can 
define the density m,:IxI^R, 

rnA x a) — J J k t (x a , yp) d/i(y) dp(J3), for all x a eXx I, 

1 x 

as well as the normalized kernel a, : (X x I) x (X x I) ^> M, 

_ a kt(x a ,yg) 

a t (x a , yp) = -= , for all (x a , yp) e (X x I) x (X x I). 

yjm t (x a ) yjmfyp) 

Once again using the given assumptions, one can conclude that a, e I?{XxIxXxI ,p®p®^®p). 
Thus it defines a Hilbert-Schmidt integral operator A, : L?(X xl,p® p) — > L 2 (X xl,p® p), 

(A,f)(x a ) = J J a t (x a ,yp)f(yp) df4y) dp(p), for all / e L 2 (X X I, » ® p). 
1 x 

Let {^ ( '\>i and {A t ^i^i denote the eigenfunctions and eigenvalues of A t , respectively. The cor- 
responding diffusion map X V I : (X x I) — > I is given by: 

¥f(x a ) 4 $\x a )}^ , for all x a e X x I. 

In the case when I is time, this diffusion map embeds the entire history of X across all of I 



into a single low dimensional space. Unlike the common embedding defined by Theorem 3.4 



each point x a is embedded in relation to the entire history of X, not just its relationship to other 
points y a from the same time. As such, for each x e X, one can view the trajectory of x through 
time as it relates to all of history, i.e., one can view: 

T x :I^{ 2 
T x (a) 4 ¥, s) (x a ). 

In turn, the trajectories {T x } xe x can be used to define a measure of similarity between the data 
points in X that takes into account the history of each point. 

Remark 3.5. It is also possible to define k, in terms of the inner products of the symmetric 
diffusion kernels, i.e., 



k,(x a ,yp) = J af(x, u) a®(y, u) dp(u). 



Remark 3.6. The diffusion distance and corresponding analysis contained in Section [3] can be 
extended to the more general case in which one has a sequence of data sets {X a } ae j for which 
there does not exist a bijective correspondence between each pair. If there is a sufficiently large 
set S such that S c X a for each a e I, then one can compute a diffusion distance from any 
x a e X a to any yp e Y through the common set S . See Appendix A for more details. 
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4. Global diffusion distance 



Now that we have developed a diffusion distance between pairs of data points from (X x I) x 
(X x I), it is possible to define a global diffusion distance between F a and Tp. The aim here is to 
define a diffusion distance that gives a global measure of the change in X from a to p. In turn, 
when applied over the whole parameter space, one can organize the global behavior of the data 
as it changes over I. For each diffusion time t e N, let £) w : {T a } a el x {Ta\ ae x — > M be this 
global diffusion distance, where 



&Hr a J P ? = \\A< a -A' 



|2 

las 

|2 



= k°-4i 



J J (a«W) - 4'W)) 2 dfi(x)dp(y). 



In fact, since ju is a <x-finite measure, the global diffusion distance can be written in terms of 
the pointwise diffusion distance by applying Tonelli's Theorem: 



D {t) (V a ,Tpf = j D (l 



\x a ,x p ) dfi(x). 



Thus the global diffusion distance measures the similarity between F a and Tg by comparing the 
behavior of each of the corresponding diffusions on each of the graphs. Therefore, the global 
diffusion distance will be small if T a and Fp have similar geometry, and large if their geometry is 
significantly different. 

As with the pointwise diffusion distance D w , the global diffusion distance can be written in 
a simplified form in terms of the spectral decompositions of the operators A a and Ah. 

Theorem 4.1. Let (X,/u) be a measure space and {k a } ae j a family of kernels defined on X. If 
(X,/j.) and {k a } ae j satisfy the properties of Assumption^ then the global diffusion distance at 
time t between F a and Tp can be written as: 

o»<r„ Tpf = 2 tey - (^)') 2 <e (is) 

We postpone the proof till the end of this section. Equation (p~8]> gives a new way to in- 



terpret the global diffusion graph distance. The orthonormal basis {0§ }i>\ is a set of diffusion 
coordinates for F ff , while the orthonormal basis {x/h ] ?}j>\ is a set of diffusion coordinates for r^. 
Interpreting the summands of ([18} in this context, we see that the global diffusion distance mea- 
sures the similarity of T a and Tp by taking a weighted rotation of one coordinate system into the 
other. 

Remark 4.2. As with the pointwise diffusion distance, the asymptotic behavior of the global 
diffusion distance when irQ.}^/ is a family of connected graphs is both interesting and easy to 
characterize. Under the same connectivity assumptions as Remark pO] it is not hard to show that 

lira o®cr , r^) 2 = 2^1-^ V^oo, J " 
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Proof of Theorem \4. 1 | Since 



£> (f) (T„,r» 2 = J" D t (x a , xp) 2 dfx(x), 



we can build upon Theorem 3.2 In particular, we have 



t,<J>f v \2 



X V;>1 ;>1 

-2 2 (^y (^y ^ (*> <e 



'j>i 



d/j.(x). 



As in the proof of Theorem 3.2 we have three terms that we shall evaluate separately. Focusing 
on the cross terms as before, we would like to switch the integral and the summation; this time 
we need to show 



Z f |(A®y(flV®w^\x)(^,^> i , ( ^ ) |^) 



< oo. 



(19) 



One can show ([19} by using Holder's Theorem, the Cauchy-Schwarz inquality, and the assump- 
tion that A a is a trace class operator for each a e I. Therefore we can switch the integral and the 
summation, which gives: 



f z w ^ w ^> w> *w = z w w 



2 



x ^ 

A similar calculation also shows that for each a e J, 



r 2(4 ) f^ ) w 2 ^w=z(^) 2 



Putting ( |20| > and ([2TJ together, we arrive at: 

^(T 0> F,) 2 = J] (A©f + £ ($>f - 2 J] (^))' (Af) <e #>* w 



(20) 
(21) 

(22) 



Furthermore, by the Spectral Theorem we can take Wa}t>\ and {t/^ }/>i to be orthonormal bases 
for L 2 (X,/i). In particular, 

jy« ^ o) > 2 - #> 2 = *• for a11 io > j° * L 

Therefore we can simplify d22l) to 



2 

L 2 (X lA1 )- 



U>1 



□ 
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5. Random sampling theorems 



In applications, the given data is finite and often times sampled from some continuous data 
set X. In this section we examine the behavior of the pointwise and global diffusion distances 
when applied to a randomly sampled, finite collection of samples taken from X. 

5.1. Updated assumptions 

In order to frame this discussion in the appropriate setting, we update our assumptions on 
the measure space (X,p) and the kernels {k a } ae j. The results from this section will rely heavily 
upon the work contained in |[T4l[T5l . and so we follow their lead. First, for any / e N, let C l b (X) 
denote the set of continuous bounded functions on X such that all derivatives of order I exist and 
are themselves continuous, bounded functions. 

Assumption 2. We assume the following properties: 

1. The measure p is a probability measure, so that p(X) — 1. 

2. X is a bounded open subset ofM. d that satisfies the cone condition (see page 93 of H16V ). 

3. For each a 6 I, the kernel k a is symmetric, positive definite, and bounded from above and 
below, so that 

< Ci(q-) < k a (x,y) < C 2 (a) < oo. 

4. For each a el, k a e C d h +l (X x X). 

Note that every property from Assumption[T]is either contained in or can be derived from the 
properties in As sumption [2] Therefore the results of the previous sections still apply under these 
new assumptions. 

The first assumption that p be a probability measure is needed since we will be randomly 
sampling points from X. The probability measure from which we sample is p. The second 
and fourth assumptions are necessary to apply certain Sobolev embedding theorems which are 
integral to constructing a reproducing kernel Hilbert space that contains the family of kernels 



{a a } ae j and their empirical equivalents. More details can be found in Appendix B 



5.2. Sampling and finite graphs 

Consider the space X and suppose that X„ = {x (1) , . . . , x (n) } c X are sampled i.i.d. according 
to p. We are going to discretize the framework we have developed to accommodate the samples 
X„. Let T aA = (X„, k a \x n ) be the finite graph with vertices X„ and weighted edges given by k a \x„. 



We now define the finite, matrix equivalents to the continuous operators from section 3.1 To 
start, first define for each a e I the n x n matrices K a as: 

KJ*, j] = ~ k a (x (i \ x u> ), for all i, j = 1, . . . , n. 
n 

We also define the corresponding diagonal degree matrices D„ as: 

-t n n 
B a [i, i] = - Y k a (x (i \ x U) ) = y K a [i, j], for all i = 1, ... , n. 
;=i 7=1 

Finally, the discrete analog of the operator A a is given by the matrix A^, which is defined as 

A„ = D a K a D a 1% , for all a el. 
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We can now define the pointwise and global diffusion distances for the finite graphs {T a n } a £j 
in terms of the matrices {A a } aeI . Setxf = (x (i \a) e X„xl, andlet-D® : (X n Xl)X(X n Xl) -» R 
denote the empirical version of the pointwise diffusion distance. We define it as: 

II 

= n 2 2(A^[U]-A^[j,A:]) 2 . 
Let : {T a , n }aei x {Ta,n}aei ~ * K denote the empirical global diffusion distance, where 

D^(r a , n ,r^) 2 = \\K- ^\\ HS 

n 

= £(A^/]-A£[i,/]) 2 . 

We then have the following two theorems relating D„ to D, and £)® to T) {t \ respectively. 

Theorem 5.1. Suppose that (X,p) and {k a } a€ j satisfy the conditions of Assumption^ Let n e N 
and sample X„ — {xS ', . . . , x- n '} c X i.i.d. according to p; also let t € N, t > 0, and a,/3 e I. 
Then, with probability 1 - 2e~ T , 

\D {, \xf,xf)-D^(x { £,x { H < C{a,P,d,t)^, for alii,) = \,...,n. 

Theorem 5.2. Suppose that (X,p) and {k a } a£ j satisfy the conditions of Assumption^ Let n e N 
and sample X„ = {x^, . . . , x^} c X i.i.d. according to p; also let t e N, t > 0, and a, ft e I . 
Then, with probability 1 — 2e~ T , 

P'\Y a ,Yp) - flf^r^l < C(a,/3,d,t)^. 

6. Apphcations 

6.1. Change detection in hyperspectral imagery data 

In this section we consider the problem of change detection in hyperspectral imagery (HSI) 
data. The main ideas are the following. A hyperspectral image can be thought of as a data cube 
C, with dimensions LxWxD. The cube C corresponds to an image whose pixel dimensions are 
Lx W. A hyperspectral camera measures the reflectance of this image at D different wavelengths, 
giving one D images, which, put together, give one the cube C. Thus we think of a hyperspectral 
image as a regular image, but, each pixel now has a spectral signature in M. D . 

The change detection problem is the following. Suppose you have one scene for which you 
have several hyperspectral images taken at different times. These images can be taken under 
different weather conditions, lighting conditions, during different seasons of the year, and even 
with different cameras. The goal is to determine what has changed from one image to the next. 

To test the diffusion distance in this setting, we used some of the data collected in [ 1 1. Using 
a hyperspectral camera that captured 124 different wavelengths, the authors of U) collected hy- 
perspectral images of a particular scene during August, September, October, and November (one 
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(a) August 



(b) September 



(c) October 



(d) November 



Figure 1: Color images of the four months. 




Figure 2: Color image of the changed scene (taken in October). 



image for each month). In October, they also recorded a fifth image in which they introduced 
two small changes into the scene as a means for testing change detection algorithms. For our 
purposes, we selected a particular 100 x 100 x 124 sub-cube across all five images that contains 
one of the aforementioned introduced changes. Color images of the four months are given in 
figure [T] while the fifth image taken in October and containing the change is given in figure [2] 
In all five images one can see in the foreground grass and in the background a tree line, with a 
metal panel resting on the grass. In the additional fifth image, there is also a small tarp sitting 
on the grass. The images were obviously taken during different times of the year, ranging from 
Summer to Fall, and it is also evident that the lighting is different from image to image. One 
can see these changes in how the spectral signature of a particular pixel changes from month to 
month; see figure |3(a)1 for an example. 

The authors did use the same camera for each image though, so in order to simulate one using 
different cameras we did the following. For each of the five images, we randomly selected 50 
of the 124 bands to use; we also randomly reordered each set of 50 bands. To see an example 



of these new spectra, we refer the reader to figure 3(b) While the seasonal and lighting changes 



affected the spectra, it is clear from figure 3(a) that they are still at least intuitively comparable 



since the same camera was used. Using the five random "cameras," however, makes a direct 
comparison between months now much more difficult. 

We set the parameter space as I — {aug, sep, oct, nov, chg}, where chg denotes the data set 
with the change in it. We also set I~ (4) = {aug, sep, oct, nov) c I . For each a e I, we let X a 
denote the corresponding 100 x 100 x 50 hyperspectral image taken with our "random" camera 
(derived from the original hyperspectral images as described above). For each month as well as 
the changed data set, we computed a Gaussian kernel of the form: 

k a (x, y) = e -ll*-yll-M*)- > for a n ae j f x ,yeX a , 
where e(a) was selected so that the corresponding symmetric diffusion operator (matrix) A a 

(2) 

would have second eigenvalue A„ w 0.97. By forcing each diffusion operator to have approxi- 
mately the same second eigenvalue, the five diffusion processes will spread at approximately the 
same rate. We then computed the diffusion distance between a pixel x taken from X c h g and its 
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(a) August (b) September (c) October (d) November 



Figure 4: Map of D (1) (jt c h g , x a ) for each a s i"' 4 * 

corresponding pixel in X a for each a e I~ (4) , i.e., we computed D w (x c h g , x a ). The results for t — 1 
are given in figure [4] while the asymptotic diffusion distance as t — > oo is given in figure [5] 

For the diffusion time t — 1, we see from the maps in figure |4] that the tarp is recognized as 
a change. However, other changes due to the lighting or the change in seasons also appear. For 
example, even in October, the small change in the shadow is visible, while in August, September, 
and November the change in lighting causes the panel to be highlighted. Also, in some months 
even the trees have a weak, but noticeable difference in the their diffusion distances. When we 
allow t — > co though, the smaller clusters merge together and the changes due to lighting and 
seasonal differences are filtered out. As one can see from figure [5] all that is left is the change 
due to the added tarp (note that the change around the border of the panel is due to it being 
slightly shifted from month to month). Thus we see in practice that the diffusion distance can be 




(a) August (b) September (c) October (d) November 



Figure 5: Map of lim^c D('\x<±g, x a ) for each a e _T' 4 ' 
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Figure 6: Global diffusion distance. Red: £> (,) (r c hg,r ailg ), green: V (1) (r chg ,T sep ), blue: 2J ( "(r chg , r ocl ), black: 

2) ' (r c hg , Tnov ) 

used to filter types of changes at different scales. 

We also computed the global diffusion distances between the changed data set and the four 
months. The results are given in figure [6] We see several intuitions borne out in this particular 
application. First, the closer the month in real time to October (the month in which the changed 
data set was recorded), the smaller the global diffusion distance. Secondly, we see that as the 
diffusion time t gets larger, the smaller the global diffusion distance. 

6.2. Parameterized difference equations 

Suppose one has the following example developed by Mezic and Lederman of a dynamical 
system that depends on a certain input parameter. We take the data set X a to be the set of orbits 
for the parameter a. If one can define a suitable set of kernels {k a } ae j on the orbits, then using 
the diffusion distance it is possible to not only organize the behavior of the system for a fixed 
parameter, but to also see how the dynamics of the system change as as the parameter is changed. 

Consider the standard map, which is an area preserving chaotic map from the torus T 2 = 
2n(S 'xS 1 ) onto itself. It is defined as: 

Pe+i = p { + asm(6 { ), 
6m =0c + pe+u 

where a e I — [0, oo) is a parameter, t e Nu{0}, and (pc , 9e) e T 2 . When a = 0, the map consists 
solely of periodic and quasiperiodic orbits. For a > 0, the map is is increasingly nonlinear as a 
grows, which in turn increases the chances of observing chaotic dynamics. 

Using the ideas developed in lfT7l[T8l , it is possible to define a kernel k a on the orbits of the 
standard map with parameter a. One can in turn use this kernel to define a diffusion map on these 
orbits; an example for a small a is given in figure [7] Using the ideas contained in Section [3] it 
is also possible to embed each diffusion map, for all a £ I, into a single embedding. Doing so 
allows one to observe how the dynamics of the system change as the parameter is increased; see 
figure[8]for more details. In the forthcoming paper [ 19], we give a full treatment of these ideas. 

6.3. Global embeddings 

Another application of the global diffusion distance is that it gives a metric by which to 
compute a "graph of graphs". By this we mean the following: given our family of graphs {r a } ae j, 
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Figure 7: Diffusion map of the orbits of the standard map for a small a. The color of the embedded point on the left 
corresponds to the orbit of the same color on the right. A particular embedded point and orbit are highlighted in purple 




Figure 8: Common diffusion embedding of the orbits of the standard map across several values of a. The color of the 
embedded point indicates the value of a used in the standard map. Notice, in particular, that many of the periodic and 
quasiperiodic orbits for low values of a that are embedded into the central ring of the embedding turn into chaotic orbits 
for higher values of a. This in turn is realized by the diffusion map as the embedding has less structure. 
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we can compute a new graph Q., = ({F^j^x, k t ), in which the vertices of Q, are the graphs \r a } ae j 
and the kernel k, : {r a } a£ j X {r^gx — > R is a function of the global diffusion distance One 
natural way to define k, is via Gaussian weights: 

k t (T a , r» 4 g-^cr^JV > for all aj g 6 j (23) 

Note that for each diffusion time t, we have different kernel k, which results in a different graph 
Q.,. Fixing a specific, but arbitrary diffusion time t, one can in turn construct a new diffusion 
operator on the graph Q., by using k t as the underlying kernel. For example, if I is finite and we 
let m, : {F a } ae j — > R be the density of k,, where 



m,(T a ) = ^^(F^r^), 



then the corresponding symmetric diffusion kernel a, : {T } oe j x {T a } ae j — > R would be defined 

as 



Since we are assuming I is finite, one can think of a, as a \I\ x \I\ matrix, and one can compute 
the eigenvectors and eigenvalues of a,. Fhe standard diffusion embedding can then be used to 
cluster the family of graphs \r a } ae j. Keep in mind that this second diffusion embedding has 
its own diffusion time, completely separate from the diffusion time t on the individual graphs 
{T a }aei- Finally, if I is infinite, then one must define an appropriate measure on the family of 
graphs {r a } a(E j or on the parameter space I . 

We apply this idea to the following example. Our initial data set X is a torus with a central 
radius of six and a lateral radius of two; more formally, X = {6S 1 ) x (2S l ), where S l is the 



unit circle in R 2 . An image of X is given in figure 9(a) We assume that the central circle 6S 1 
and the lateral circle 25 1 are oriented, so that each point on the torus has a specific coordinate 
location (note that while X c R 3 , the points of the torus have a two dimensional coordinate 
system consisting of two angles, one for the central circle and one for the lateral circle). 

From X we build a family of "pinched" torii as follows. We pick an angle on the central circle 
65 , say #0, and we pinch the torus at 80 so that its lateral radius at this angle is now ro, where 
ro < 2. So that we do not rip the torus, from a starting angle 6 S , the lateral radius will decrease 
linearly from 2 at 8 S to ro at 6q, and then increase linearly from r ( > at 9q back to 2 at some ending 
angle 8 e . The lateral radius of this new torus will be 2 at all other angles on the central circle. 



For an example see figure 9(b) 



We create several pinched torii as follows. We take three different angles to pinch the torus 
at: 8q = nl2,n, and 37r/2. At each of these three angles, we pinch the torus so that the lateral 
radius ro at 9q can take one of ten values: ro = 1, 1.1, 1.2, . . . , 1.9. The starting and ending angles 
for each pinch are offset from 8q by n/A radians, so that 8 S = # _ f/4 and 8 e — 8q + n/A. Thus we 
have 30 different pinched torii, which along with the original torus, gives us a family of 31 torii. 

Our goal is to build a graph in which each vertex is one the 3 1 torii. To do so we approximate 
the global diffusion distance between each pair of torii by taking 7744 random samples from X 
(using the uniform distribution), and then using the same corresponding samples for each pinched 
torus. For each torus we used a Gaussian kernel of the form 



k a (x, y) = c" MW , for all a = (flb, ro), 
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(a) Regular torus 



(b) Pinched torus 



Figure 9: Regular and pinched torii 



where s(a) was selected so that the corresponding symmetric diffusion operator (matrix) A a 
would have second eigenvalue A.^ - 0.5. The pairwise global diffusion distance was further 
approximated by taking the top ten eigenvalues and eigenvectors of each of the 31 diffusion 



operators, and was then computed for diffusion time t — 2 using Theorem 4.1 Two remarks: 

(2) 

first, the diffusion time t — 2 — 1/(1 - A a ) corresponds to the approximate time it would take 



for the diffusion process to spread through each of the graphs; secondly, by Theorem |5.2| this 
approximate global diffusion distance is, with high probability, nearly equal to the true global 
diffusion distance between each of the torii. 

After computing the pairwise global diffusion distances, we constructed the kernel k t , for 
t = 2, defined in equation ( |23| ). We took s in this kernel to be the median of all pairwise global 
diffusion distances between the 31 torii. We then computed the symmetric diffusion operator 
for this graph of graphs, which turned out to have second eigenvalue /l (2) w 0.48. We took the 
top three eigenvalues and eigenvectors of the diffusion operator, and used them to compute the 
diffusion map into M 3 at diffusion time l' * 1/(1 - 0.48) = 1.92. 

A plot of this diffusion map is given in figure [TO] The central, dark blue, circle corresponds 
to the regular torus in both images. In figure |10(a)[ the other three colors correspond to the angle 
at which the torus was pinched. In figure |10(b)| the colors correspond to the strength of the 
pinch (dark blue - no pinch, dark red - strongest pinch). As one can see, the diffusion embedding 
organizes the torii by both the location of the pinch (i.e. what arc the embedded torus lies on), 
and the strength of the pinch (i.e. how far from the regular torus each pinched torus lies), giving 
a global view of how the data set changes over the parameter space. 



7. Conclusion 

In this paper we have generalized the diffusion distance to work on a changing graph. This 
new distance, along with the corresponding diffusion maps, allow one to understand how the 
intrinsic geometry of the data set changes over the parameter space. We have also defined a 
global diffusion distance between graphs, and used this to construct meta graphs in which each 
vertex of the meta graph corresponds to a graph. Formulas for each of these diffusion distances 
in terms of the spectral decompositions of the relevant diffusion operators have been proven, 
giving a simple and efficient way to approximate these diffusion distances. Finally, it was shown 
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(a) Colored by location of pinch 



(b) Colored by strength of pinch 



Figure 10: Diffusion embedding of the 31 torii 

that a random, finite sample of data points from a continuous, changing data set X is, with high 
probability, enough to approximate the diffusion distance and the global diffusion distance to 
high accuracy. 

Future work could include generalizing these notions of diffusion distance further so that they 
can apply the sequences of graphs in which vertices are added and dropped (i.e. in which there 
is no bijective correspondence between graphs). Also, it would be interesting to investigate how 
this work fits in with the recent research on vectorized diffusion operators contained in ||201|2T1 
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Appendix A. Non-bijective correspondence 

In this appendix we consider the case in which our changing data set does not have a single 
bijective correspondence across the parameter set I. We make a few small changes to the nota- 
tion. Continue to let I denote the parameter space, but let (X,fi) denote a "global" measure space 
(with X c W l ). Our changing data is given by \X a } a£ j with data points x a e X a , and satisfies 

X a C X, for all a e I. 

We assume that each data set X a is a measurable set under fi. Suppose, additionally, that there 
exists a sufficiently large set S c X such that 

S cX„, for all a e I. 

We maintain the remaining notations and assumptions from Section [3] and simply update them 
to apply for each X a . In particular, for each a e I, we have the symmetric diffusion kernel 
a a : X a x X a — > M, with corresponding trace class operator A a : L 2 (X a ,fi) — > L 2 (X a ,fi). The 
set of functions {^®};>i c L 2 (X a ,fi) still denote a set of orthonormal eigenfunctions for A a , with 
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corresponding eigenvalues {A^}^. 
^Pq, (x a ) 



The difFusion map is still given by : X a -» ( 2 , with 



Under this more general setup, for any a,B e I, the sets X„ \ Xp and Xjg \ X a may be nonempty. 
Thus it is not possible to compare the diffusions on T a and Vp as they spread through each graph. 
On the other hand, since we have a common set S c X a n Xp, we can compare the diffusion 
centered at x a e X a with the diffusion centered at yp e Xp as they spread through the subgraphs 
of F a and Tp with common vertices 5 . Formally, we define this diffusion distance as: 

D ( '\x a ,yp; S) 2 4 J s ) - af(y p , sjf dfi(s), for all a,B e I, (x a ,yp) eX a x Xp. 



A result similar to Theorem 3.4 can be had for this subgraph diffusion distance. Since the 
eigenfunctions for A a will not be orthonormal when restricted to L 2 (S,/i), one must use an ad- 
ditional orthonormal basis {e w },>i for L 2 (S,^i) when rotating the diffusion maps across I into a 
common embedding. In particular, we define a new family of rotation maps O a jj ■ t 2 — > t 2 as: 



O a , s v 



L 2 (S,n) 



V./>1 



'/>1 



Using these rotation maps, along with the same ideas from Section[3] one can show: 



D { '\x a ,yp;S) = \\0 a , s ^(x a ) - Op, s Vf(yp)\\ 



with convergence in L (X a x Xp,^i® fi). 



Appendix B. Proof of random sampling theorems 



In this appendix we prove the random sampling Theorems 5.1 and 5.2 from section [5] 
Throughout the appendix we shall assume that (X,fi) and {k a } ae j satisfy Assumption|2] 

The proof shall rely upon a result from [ 14] as well as several results on the asymmetric graph 
Laplacian I — P that are contained in [ 15 1. All of these results are easily translated for our family 
of operators {A a } ae j, and we shall simply restate the needed results from [ 15 1 in these terms. 



Appendix B.l. Reproducing kernel Hilbert spaces 

Critical to our analysis will the be existence of a single reproducing kernel Hilbert space 
(RKHS) that contains the set of kernels {a a } a£ i, their empirical approximations, and related 
functions. In jT5 1 sucn a RKHS is constructed. Here we recall the definition of a RKHS as well 
the aforementioned construction. 

A set is a RKHS if it is a Hilbert space of functions / : X — > M such that for each x e X, 
there exists a constant C(x) so that 

/W<C(jc)||/||«. 

The name RKHS comes from the fact that one can show that there is a unique symmetric, positive 
definite kernel /i:XxX->R associated with 'H such that for each / e <H, 

f(x) = </, h(x, •))«, for all x e X. 
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We utilize a specific RKHS first presented in [15); the construction is rewritten here for 
completeness. Let I be a positive integer, and define the Sobolev space < H l as 

<H l = {f e L 2 (X, dx) : D y f e L 2 (X, dx) for all |y| = /}, 

where D y f is the weak derivative of / with respect to the multi-index y = (yi , . . . , y^) e W 1 , 
\y\ = y\ + ■ ■ ■ + yd, and dx denotes the Lebesgue measure. The space 'H 1 is a separable Hilbert 
space with scalar product 



= (f,g)LKx,dx) + J](Dyf,D y g) L i ixMx) . 



Also note that the space C\ (X) is a Banach space with respect to the norm 



ll/ll 



CUX) 



sup + Y suplDVWI- 



As explained in [15], since X is bounded, we have C' b (X) c H 1 and ||/||^i < C(l)\\f\\ c i ( X) . Via 
Corollary 21 of section 4.6 from ifTBTl . if m e N and l — m> d/2, then we also have: 

TY'cqXX) and \\f\\c^(X) < C(Z,m) \\f\\ w . (B.l) 

Following flBI . if one takes s = [d/2] + 1, then using ( |B.l| i with I = s and m = Owe see that W 
is a RKHS with a continuous, real valued, bounded kernel h s . 

Appendix B.2. Additional operators 

In this section we define several operators that will bridge the gap between the matrix A a 
and the operator A a . All of these definitions are based on those from [ 15 1 for the asymmetrical 
diffusion operators (i.e. P). To start, define the empirical density maps m an : X — > K in terms of 
the samples X„ = {x m , . . . , x w } as 



1 " 

ia n(x) — — / k a (x, x w ), for all a e I, x e X. 
n ^ 



Note that m ffj „(x w ) = D„[/, i]. We also define the empirical kernels a an : X X X 

k a (x,y) 



. as 



a a , n (x,y) = 



for all a e I, x,y e X. 



^m a ,„(x) ^m arn (y) 

We then have the following lemma from 1 15 1, adapted for symmetric diffusion operators. 

Lemma Appendix B.l (Lemma 16 from lfl5l ). Assume that (X,fi) and {k a } ae j satisfy the con- 
ditions of Assumption^ Then, for all a € I and for all x e X, 

k a (x, -\m a ,m a „, — , — e Cf\X) c <H d+x c <H S , 

rrl a Titan 



\\k a (x, 



- IIWq.II 



1 




1 


m a 







Cf-\X) < || OT «.n|| C f \X) ■ 



a a (x,-),a a Jx,-) e Cf\X) c W d+l c <H S 
\\aJx, -)\\ w , \\a aA (x, -)\\^ s < C(a, d). 
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< C(a, d), 



Lemma Appendix B. 1 allows one to define the operators A a <h* : < H S — > < H S and A a n : r H s — > 

(A a <Hsf){x) 4 J fltt (x, y) </, ft,(y, 4u(y), for all a e I, feW, 

x 

1 " 

(A a ,„f)(x) 4 - Y a a ,„(x, </, /z. s (x®, •)>«» , for all a e J, / e 

We also define similar operators : < H S — > < K S and T n : < 7/ s — > "H*, but in terms of the 
reproducing kernel h s . 

(T w f)(x) 4 J y) (/, h s (y, ■))<„* dfi(y), for all / e <H S , 

x 

1 " 

(T n f)(x) 4 - V </, /z. s (x (i) , •)>«*, for all / £ <H' . 

n 

i=i 

The above operators, as well as A a and A a , can be decomposed in terms of the appropriate 
restriction and extension operators. We begin with the two restriction operators, : < H S — > 
L 2 (X,fi) andR n : W -> R". 

(Rwfflx) 4 </, A^jc, ■)>«», for a.e. x e X, for all / e <H S , 

Rnf 4 . . . , /(x ( ">)), for all / € <H S . 

For each a e I we also have two extension operators, E a ^ : L 2 (X,p) — > *H S and E a _„ : M" — > 
*7f where 

(Ea#.f)(x) 4 J a a (x,y)/(y) dyXy), for all jc 6 X, / e L 2 (X,aO, 

A' 

1 " 

(E a n v)(x) 4 - V V [i] a a „(*, x (0 ), for all jc e X, v e W. 
n 4-? 

Using these operators, one can easily show the following identities: 

A a = R<H*E a!H > and A a ^ = E a ^,R w , 

A. a — R n E an and A an — E a „R„, (B-2) 
Tw=R* w R w and T n = R* n R n . 

Appendix B.3. Similarity between empirical and continuous operators 

Here we collect remaining results that we shall need that involve the similarity between the 
empirical and continuous versions of the previously defined operators and functions. All of these 
results can be found in lfT4l[T5l . 

Theorem Appendix B.2 (03], also Theorem 7 from lfT31l ). Suppose that (X,p) and {k a } ae j 
satisfy the conditions of Assumption^ Let n € N and sample X n — {xf- l \ . . . ,x^\ c X i.i.d. 
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according to p; also let t > 0. Then the operators and T„ are Hilbert-Schmidt, and with 
probability 1 - 2e~ T , 

\\T<Hs~T n \\ HS <C(d)^. 

Theorem Appendix B.3 (Theorem 15 from |15|). Suppose that {X,p) and {k a } ae j satisfy the 
conditions of Assumption^ Let n 6 N and sample X n — {x^\ . . . ,x in) ) C X i.i.d. according to 
p; also let t > and a € I. Then the operators and A a n are Hilbert-Schmidt, and with 

probability 1 - 2e~ T , 

yfn 

Lemma Appendix B.4 (Lemma 18 from [15]). Suppose that (X,p) and {k a } a£ j satisfy the 
conditions of Assumption^ Let n G N and sample X„ — {x^\ . . . , x^} c X i.i.d. according to p.; 
also let t > and a € I. Then, with probability 1 — 2e~ T , 



,,W ~ A a.>i\ HS < C(a,d) — . 



yfn 



Appendix B.4. Proof of Theorem\5.1 



In this section we prove Theorem 5.1 which we restate here. 



Theorem Appendix B.5 (Theorem 5.1 1. Suppose that (X,p) and {k a } ae j satisfy the conditions 
of Assumption^ Let n G N and sample X n - {xf^\ . . . , x^} c X i.i.d. according to p; also let 
t G N, t > 0, and a,B e I. Then, with probability 1 - 2e~ T , 



\D {t \x^,xf)-Df{x^,x ( H < C(a,B,d,t)^, forallij = l,...,n. 
\ p ' y« 



Proof of Theorem \5. 1 | First an additional piece of notation. Recall the li-dimensional index y = 
■ ■ ■ > Jd)- Let d y x a a denote the y th partial derivative of a a with respect to the variable x. 
We begin with the empirical diffusion distance. Recall that D„(x^,Xg ) 2 = n 2 \\A' a [i, ■] - 

^'Aj' ']IIr»- F° r eacn i — 1, ■ • • , «, define the vector G W as 



A^{J| foraU^l,...,. 



We then have 



Df(xf,y^=n 2 \\A'J i) -^e%„ 

= n 2 (A' a e {i \ A'J'%, + n\M/'\ A^'V - 2n 2 (A' a e (i \ A'/j%„. (B.3) 

A similar expression can be had for the continuous diffusion distance. By assumption [2] 
k a G Cf +1 {X x X) and k a > Ci(a). These imply that a a G C d b +l (X x X). We can then apply 
Mercer's Theorem to get that 

a$(x, y) = Yj W *f 00 ^°Cy). for all (*, y)eXxX, (B.4) 
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with absolute convergence and uniform convergence on compact subsets of X. In fact, since A a 
is also trace class, we can get uniform convergence on all of X. Indeed, 



Tv(A a ) = Y j A^< 



Therefore, for all s > and for each a e I, there exists N(s, a)eN such that 

z w 



< E. 



C>N(e,a) 



Furthermore, since a a is bounded, ifr a ' < Ci{a) for all ( > 1. Therefore, 

2 (^)Vf«^00<C 2 (a) J] (4°)'<C 2 (a) £ , forallfey)eZxX (B.5) 



(>N(B,a) t>N(s,a) 



Now define a family of functions e L z (X,/i) for all JV e N and /€{!,...,«}, 



zv 



We claim that 



Indeed, 



^\x)^^(x^(x). 
e=i 

\a®(x®, x) - A* a (p^ s ^{x)\ < C 2 (a) s, for all x e X. 



/ AT 



d//(y), 



2Z(4 m) )vrw^u c °) r # 



m) (y)if>f(y)d(i(y), 



m>\ (=1 
N 



Therefore, using (|B.4|i, (|B.7|>, and (|B.5|>, we obtain 



^>(x«^)-A>^>(x)| = 



^ (4 Q )V<V )^(*) 



< C 2 (or) £, 



and so ( |B.6| l holds. 

Using (|B.6|l, it not hard to see that 



Hl 2 (x,^) 



<C 3 (a,yS)e. 
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(B.6) 



(B.7) 



Thus it is enough to consider WA^tp^^'"^ - A^ipi ^ \\l\x,h)- Expanding the square of this 
quantity one has 



t in (N(e,a),i) 



- A' J N <- S ^f>\\ 2 - (A' ,„( N ( e - a M At m (N(e,a),iK , 



)i 2 (A-,/i)- 

(B.8) 



The three inner products in ( |B.3| l correspond to the three inner products in ( |B.8| >. We aim to 
show that each pair is nearly identical. We will do so explicitly for the pair n 2 (A' a e (j \ Age^)n» 

and {A' a <p^ E ' a ^ i \ A 'pip ( p if '^'^) L i( X ^)\ the other two pairs are simply special cases of this one. We 
begin with the discrete inner product, for which we have the following with probability 1 - 2e~ T : 



n 2 <A^« A^> K „ = n'((R„E a Je w ,(R n E fitn ye^) v 



n\{E a ^Rj- l E a ^\RlR n {E Ptn Rj- l Ep / ,e <J) ) H 



= (A'^a^K •), T n A^a^\ ■)>« 



■U) 



(B.9) 
(B.10) 



< (A^.o^jc®, ■), r^a^f^, + C{a,p,d,t)^-, (B.ll) 

where ( |B.9| > follows from ( |B.2| i, ( B.10| > follows from ( |B.2[ ) and the definitions of E a>n and e w , and 
(JbTTTJ follows from Lemma Appendix B.l Theorem Appendix B.2 Theorem Appendix B.3 



and the Cauchy-Schwarz inequality. Since the argument is symmetric, we have, with probability 

1 - 2e-\ 

|n 2 <A^©, A^>a, - (A t ^ s a a , n (x m ,d,T^A t ^a An (x^\-))-H'\ < C(a,/3,d,t)^. (B.12) 
Now return to the continuous inner product. With probability 1 - 2<?~ T , we have: 

I \t ,SN(s,a),i) At.(N(efi),j\ _ /(J> F -.r (N(s,a),i) < p F /n 

\ A ff^ff i A 0p /L 2 (X,fi) ~ \(t<'H<Za,<H<) <P a ,K&H>£-p,<H>) ^ >L 2 (A^) (B.13) 



( A 'a,W E <*,W<Pa 



,(N(s,a),i) T....A'- 1 £ ,„W*1 



m 4/— 1 77 V JV 



)<H S 



(B.14) 



where ( |B7T3] > and ( |B7T4l > both follow from ( |BT2] >. 

Examining ( |B.12| i and (|B.14| i, it is clear that to complete the proof we must bound the quan- 



tity \\a a ^(x {l \ ■) - E a ;H' ifa"" e ' a) ' i> | \h* ■ We break it into two parts: 



\\a a Jx w , •) - E a ^ a N(B - aM l Hs < \\a a/l {x">, ■) - a a (x w , .)|| w . + \\a a (x w , - E^ift 
For the first part, some simple manipulations give: 

a a , n (x (i \ x) - aj/'Kx) = f® (x) + g%(x), 

where 

,., k a (x ( '\ x) ( yfmjx) - Jm a Jx)) 
f (x) — 
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,(0 



(B.15) 



and 



^(X (,) ,X)(V«V„(X«)- V'««(X (,) )) 



y/m atn (xW) ^m a {x {i) ) y/m a (x) 
For the first of these two functions, using Lemma [Appendix B.l and Lemma Appendix B.4 



easy to see that ll/i^Hw < C(a, d)^= with probability 1 - 2e T . For g®,, note that 
\m a Jx (,) ) - m a (x (,) )\ < sup \m a Jx) - m a {x)\ 



it is 



\\m a ,n 



ma \\c«(X) 



< C(d) \\m a ,„ - m a \y d+1 



< C(or, d) 



(B.16) 



where in ( |B.16| l we once again used Lemma 
probability 1 - 2<?~ T , and so we have bounded 



Appendix B.4 Thus \\g% x \\w ^ C(a,d)^ with 
the first term on the right hand side of ( |B.15[ ). For 
the second term on the right hand side of ( |B.15) , recall the definition of || • If we can bound 
\\d y x a a {x^, ■) - 3^£ , Q -,'H'^ff A ' <£ ' a) ' ,) llL 2 (x,rfx)' where y = (i.e., no derivative) or \y\ = s, then we will 
have bounded this term as well. Note that a a £ C d b +l (X x X) implies that if/® e C s h (X) for all 
I > 1. Furthermore, the derivative d y x a a (x^'\ •) can be computed term by term from ( |B.4| i. Thus, 
using nearly the same argument we used to show ( |B.6| l, one can show that 

\d y x a a (x (i \ x) - dlE a ^ip^ (e ' ali) (x)\ < C 4 (a) s, for all x e X, \y\ < s. (B. 17) 

Using ( |B.17[ ), we have: 



dla a (x«>,-)-d 7 x E a ^ 



(N(s,a)j)\ 



] \L 2 (X,dx) 



\X\ C 4 (a) E, 



where \X\ denotes the Lebesgue measure of X. Since X was assumed to be bounded, we have 
\X\ < C. Returning to ( |B.15| >, we have now shown that: 



\\a a ,„(x ( '\ •) - E a/H s<pV n , /; 



JN(e,a),i)\\ 



, < C{a,d)^— +Cs. 
\n 



Taking e = completes the proof. 
Appendix B.5. Proof of Theorem\5.2 



□ 



Finally, we prove Theorem 5.2 



Theorem Appendix B.6 (Theorem 5.2 1. Suppose that (X,p) and {k a } ae j satisfy the conditions 
of Assumption^ Let n € N and sample X n = . . . , xf"^} C X i.i.d. according to yu; also let 
t G N, r > 0, and a,f5 e I. Then, with probability 1 — 2e~ T , 



£> w (r a ,I» - Dl\T a/l ,T p>n )\ < C(a,j3,d,t) 



yfn 
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Proof. Recall that £> w (r„,I» = \\A' a - A' p \\ HS . From Proposition 13 in JT31, we know that 
A G (0, 1] is an eigenvalue of A a if and only if it is an eigenvalue of A tt ^. Using the same 
ideas, one can show that A' + is an eigenvalue of A' a - A'^ if and only if it is an eigenvalue of 

A' 



■ A' w , . Therefore, 



U' - A' W 



■ A' 



\HS 



Similarly, one can show that 



C A p\\ HS - \\K,n A 'i3AhS ■ 



Thus, using the above and Theorem Appendix B.3 we have, with probability 1 - 2e T , 

D {t \T a ,Yi S ) = \\A t ttrHs -A t prHS \\ HS 

^ WK,n ~ A'pJhs + \W a ,w -KJhs + ||A^, -A> M \\ HS 



<D^(T a ^ n ) + C(a,/3,d,t) 



Since the argument is symmetric, we get the desired inequality. 



□ 
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